Skip to content

[Data] - TPCH Q1 Release Test - Expr#58331

Merged
alexeykudinkin merged 5 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/fix_tpch_q1
Oct 31, 2025
Merged

[Data] - TPCH Q1 Release Test - Expr#58331
alexeykudinkin merged 5 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/fix_tpch_q1

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale commented Oct 31, 2025

Description

Replace map_batches and numpy invocations with with_column and arrow kernels

Release test: https://buildkite.com/ray-project/release/builds/66243#019a37da-4d9d-4f19-9180-e3f3dc3f8043

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great improvement, replacing map_batches with with_column and Arrow expressions for the TPC-H Q1 benchmark. This change enhances both performance and readability, making the implementation much cleaner. I've identified a couple of opportunities to further improve the code by reusing newly created float columns, which will help avoid redundant computations and increase clarity.

Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Oct 31, 2025
Signed-off-by: Goutam <goutam@anyscale.com>

# Build float views + derived columns
ds = (
ds.with_column("l_quantity_f", to_f64(col("column04")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to rename the columns first as our dataset has bogus column names

Signed-off-by: Goutam <goutam@anyscale.com>
Comment on lines +39 to +57
.rename_columns(
{
"column00": "l_orderkey",
"column02": "l_suppkey",
"column03": "l_linenumber",
"column04": "l_quantity",
"column05": "l_extendedprice",
"column06": "l_discount",
"column07": "l_tax",
"column08": "l_returnflag",
"column09": "l_linestatus",
"column10": "l_shipdate",
"column11": "l_commitdate",
"column12": "l_receiptdate",
"column13": "l_shipinstruct",
"column14": "l_shipmode",
"column15": "l_comment",
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iamjustinhsu
Copy link
Contributor

@goutamvenkat-anyscale how were u able to get around the empty partition?

@alexeykudinkin alexeykudinkin merged commit 66c857c into ray-project:master Oct 31, 2025
6 checks passed
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
## Description
Replace `map_batches` and numpy invocations with `with_column` and arrow
kernels

Release test:
https://buildkite.com/ray-project/release/builds/66243#019a37da-4d9d-4f19-9180-e3f3dc3f8043

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
## Description
Replace `map_batches` and numpy invocations with `with_column` and arrow
kernels

Release test:
https://buildkite.com/ray-project/release/builds/66243#019a37da-4d9d-4f19-9180-e3f3dc3f8043

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
## Description
Replace `map_batches` and numpy invocations with `with_column` and arrow
kernels

Release test:
https://buildkite.com/ray-project/release/builds/66243#019a37da-4d9d-4f19-9180-e3f3dc3f8043

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
## Description
Replace `map_batches` and numpy invocations with `with_column` and arrow
kernels

Release test:
https://buildkite.com/ray-project/release/builds/66243#019a37da-4d9d-4f19-9180-e3f3dc3f8043

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
## Description
Replace `map_batches` and numpy invocations with `with_column` and arrow
kernels

Release test:
https://buildkite.com/ray-project/release/builds/66243#019a37da-4d9d-4f19-9180-e3f3dc3f8043

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

3 participants